Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the Orchagent crash seen during Port channel OC test cases (Issue#17665) #3042

Merged

Conversation

saksarav-nokia
Copy link
Contributor

What I did
Modified addNextHop function to pass the alias of remote system port instead of Inband port if the neighbor is remote system neighbor.
Why I did it
Fix for sonic-net/sonic-buildimage#17665
The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF.
However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference
count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF.
But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured,
then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF.
But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the
remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.

How I verified it
Ran the Port channel OC suites and db consistency OC suites multiple times and verified that the orchagent crash is not seen anymore.

Details if related

 The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF.
 However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference
 count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF.
 But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured,
 then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF.
 But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the
 remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.

Signed-off-by: saksarav <[email protected]>
@saksarav-nokia
Copy link
Contributor Author

This needs to be cherry-picked to 202205

@saksarav-nokia
Copy link
Contributor Author

@prsunny please review this

@saksarav-nokia
Copy link
Contributor Author

@judyjoseph for viz

@gechiang
Copy link
Contributor

gechiang commented Feb 6, 2024

MSFT ADO: 26718860

@yxieca , @StormLiangMS , This is a bug fix that is needed to be back ported to 202205 and 202305.
There is no MSFT repo for SWSS for 202205 and this impacts the Chassis community if without this fix. it is a well self-contained one line fix that only impacts Chassis platform and not pizzabox platform.
Thanks!

mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this pull request Feb 7, 2024
…c-net#3042)

The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF.
 However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference
 count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF.
 But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured,
 then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF.
 But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the
 remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.

Signed-off-by: saksarav <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #3043

mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this pull request Feb 7, 2024
…c-net#3042)

The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF.
 However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference
 count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF.
 But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured,
 then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF.
 But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the
 remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.

Signed-off-by: saksarav <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #3044

mssonicbld pushed a commit that referenced this pull request Feb 8, 2024
The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF.
 However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference
 count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF.
 But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured,
 then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF.
 But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the
 remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.

Signed-off-by: saksarav <[email protected]>
prsunny pushed a commit that referenced this pull request Feb 9, 2024
… (#3044)

The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF.
 However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference
 count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF.
 But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured,
 then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF.
 But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the
 remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.
@gechiang
Copy link
Contributor

gechiang commented Mar 7, 2024

@StormLiangMS I see this got approved for 202305 but I don't see it ever get picked into 202305. Can you help take a look what happened? perhaps auto cherry-pick got disabled in 202305?

mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this pull request Mar 8, 2024
…c-net#3042)

The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF.
 However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference
 count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF.
 But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured,
 then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF.
 But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the
 remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.

Signed-off-by: saksarav <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #3072

mssonicbld pushed a commit that referenced this pull request Mar 8, 2024
The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF.
 However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference
 count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF.
 But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured,
 then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF.
 But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the
 remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.

Signed-off-by: saksarav <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants